-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-28952: TableFetcher to return Table objects instead of names #6020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...tore/metastore-client/src/main/java/org/apache/hadoop/hive/metastore/utils/TableFetcher.java
Outdated
Show resolved
Hide resolved
|
Hi @vikramahuja1001 @deniskuzZ @abstractdog can we have this reviewed ? |
...tore/metastore-client/src/main/java/org/apache/hadoop/hive/metastore/utils/TableFetcher.java
Show resolved
Hide resolved
...ndler/src/main/java/org/apache/iceberg/mr/hive/metastore/task/IcebergHouseKeeperService.java
Outdated
Show resolved
Hide resolved
...tore/metastore-client/src/main/java/org/apache/hadoop/hive/metastore/utils/TableFetcher.java
Show resolved
Hide resolved
d9bbf83 to
185b976
Compare
185b976 to
d17ca25
Compare
|
|
@deniskuzZ @okumin all comments have been addressed and all checks have passed. |
|
+1 non binding |
okumin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
...ndler/src/main/java/org/apache/iceberg/mr/hive/metastore/task/IcebergHouseKeeperService.java
Show resolved
Hide resolved
|
Merged. @Neer393 Thanks for your contribution! @vikramahuja1001 @deniskuzZ Thanks for your review! |
| List<String> databases = client.getDatabases(catalogName, dbPattern); | ||
|
|
||
| for (String db : databases) { | ||
| List<String> tablesNames = getTableNamesForDatabase(catalogName, db); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Neer393 I don't understand what have you optimized here.
You are still doing multiple calls: 1 to get table names and another to get table objects. Why not get table objects directly?
Also, have you considered the memory impact when loading everything into the heap? You could have iterated over TableIterable instead. I don't think that is a robust solution, it can potentially lead to OOM.
cc @dengzhhu653, @wecharyu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The earlier implementation had one msc call for getting table names and then one msc call each for getting the HMS table object for each table name.
The newer implementation reduces the msc calls in a way that one msc call is made for getting all table names and then using TableIterable, the number of msc calls for getting table objects becomes Number of tables / (BATCH_MAX_RETRIEVE config value [Default is 300])
So in the older implementation number of msc calls = 1 + number of tables
whereas in the newer implementation number of msc calls = 1 + (number of tables / [BATCH_MAX_RETRIEVE])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my earlier implementation I had the same proposal of directly getting table objects where I had implemented direct HMS API endpoint like listTableNamesByFilter but the idea was dropped by @vikramahuja1001
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to use batching, you need to have the table list to fetch - that's ok. However, instead of working with the batches, you load everything into memory.
Could you refactor to use Iterable (i.e make getTables return Iterable<Table>)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so for this fix should I create a new JIRA or as I am working on https://issues.apache.org/jira/browse/HIVE-28974 which is related to IcebergHouseKeeperService only should I attach the fix in this JIRA ?
Whatever you suggest is fine to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Use the TableIterator is more reasonable to avoid possible OOM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me summarize the points.
- This PR would reduce # of API calls from O(num-tables) to O(num-tables / batch size). It's neat 👍
- This PR would require the O(num-tables * tbl-size) space. We'd like to reduce it
- Regardless of this PR, we retain the O(num-tables * length-of-table-name) space. Is it OK or should we optimize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, it's a tradeoff between the number of msc calls and the space.
As if we try to decrease the number of tables stored in memory, we would increase the number of msc calls and then there would be no point of this JIRA.
Please correct me if I am wrong but this is what I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern was related to the fetch logic where we load all Hive table objects into the memory instead of using the batch iterator.
We also make O(num-tables) calls to load an Iceberg table. Can we optimize here? Then we put those into a separate cache. Maybe iwe could use CachingCatalog instead ?
tableCache.get(tableName, key -> IcebergTableUtil.getTable(conf, table))
| catalogName, dbPattern, tablePattern, e); | ||
| } | ||
| for (org.apache.hadoop.hive.metastore.api.Table table : tables) { | ||
| expireSnapshotsForTable(getIcebergTable(table)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recalled that we would like to retain the try-catch. We intentionally added it to avoid skipping everything when a single expiration fails.
See also: #5786 (comment)
|
@Neer393 @deniskuzZ I created a revert PR because we found two issues to be discussed. |
|
Okay. In that case let me close the redundant JIRA that I created for fixing this. |
@Neer393 HIVE-28952 addendum should be OK. |
Okay in that case @okumin please do not revert the PR. As per @deniskuzZ I would put an addendum to it. @deniskuzZ using So I don't think we require TableIterator as such for TableOptimizer but if you say so I can add it. I say the call is your's. |
|
@Neer393 |
Oh okay. In that case yes. |



What changes were proposed in this pull request?
Modified TableFetcher to return Table Objects instead of name. In this way we reduced the number of msc calls.
Why are the changes needed?
It's an improvement to reduce the number of msc calls
Does this PR introduce any user-facing change?
No it just adds a new method to get tables as object
How was this patch tested?
Locally by executing the unit tests